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Abstract. Let s be a string whose symbols are solely available through 
access (i), a read-only operation that probes s and returns the symbol 
at position i in s. Many compressed data structures for strings, trees, 
and graphs, require two kinds of queries on s: select(c, j), returning the 
position in s containing the jth occurrence of c, and rank(c, p), counting 
how many occurrences of c are found in the first p positions of s. We 
give matching upper and lower bounds for this problem, improving the 
lower bounds given by Golynski [Theor. Comput. Sci. 387 (2007)] [PhD 
thesis] and the upper bounds of Barbay et al. [SODA 2007]. We also 
present new results in another model, improving on Barbay et al. [SODA 
2007] and matching a lower bound of Golynski [SODA 2009]. The main 
contribution of this paper is to introduce a general technique for proving 
lower bounds on succinct data structures, that is based on the access 
patterns of the supported operations, abstracting from the particular 
operations at hand. For this, it may find application to other interesting 
problems on succinct data structures. 

1 Introduction 

We are given a read-only sequence s = s [0, n — 1] of n symbols over an integer 
alphabet E = [a] = {0, 1, . . . , a — 1}, where 2 < a < n. The symbols in s can 
be read using access(i), for < i < n — 1: this primitive probes s and returns 
the symbol at position i, denoted by s[i]. Given the sequence s, its length n, 
and the alphabet size er, we want to support the following query operations for 
a symbol c G E: 

— select(c, j): return the position inside s containing the jth occurrence of 
symbol c, or —1 if that occurrence does not exist; 

— rank(c,p): count how many occurrences of c are found in s [0,p — 1]. 

We postulate that an auxiliary data structure, called a succinct index, is 
constructed in a preprocessing step to help answer these queries rapidly. In this 
paper, we study the natural and fundamental time-space tradeoff between two 
parameters t and r for this problem: 

— t = the probe complexity, which is the maximal number of probes to s (i.e. 
calls to access) that the succinct index makes when answering a query 3 ; 

3 The time complexity of our results in the RAM model with logarithmic-sized words 
is linearly proportional to the probe complexity. Hence, we focus on the latter. 



- r = the redundancy, which is the number of bits required by the succinct 
index, and does not include the space needed to represent s itself. 

Clearly, these queries can be answered in negligible space but 0{n) probes by 
scanning s, or in zero probes by making a copy of s in auxiliary memory at 
preprocessing time, but with redundancy of 0(n log a) bits. We are interested 
in succinct indices that use few probes, and have redundancy o(n log a), i.e., 
asymptotically smaller than the space for s itself. Specifically, we obtain upper 
and lower bounds on the redundancy r = r(t, n, a), viewed as a function of the 
maximum number t of probes, the length n of s, and the alphabet size a. We 
assume that t > in the rest of the paper. 

Motivation. Succinct indices have numerous applications to problems involving 
indexing massive data sets [1]. The rank and select operations are basic prim- 
itives at the heart of many sophisticated indexing data structures for strings, 
trees, graphs, and sets [12]. Their efficiency is crucial to make these indexes fast 
and space-economical. Our results are most interesting for the case of "large" 
alphabets, where a is a not-too-slowly growing function of n. Large alphabets are 
common in modern applications: e.g. many files are in Unicode character sets, 
where a is of the order of hundreds or thousands. Inverted lists or documents 
in information retrieval systems can be seen as sequences s of words, where the 
alphabet S is obviously large and increasing with the size of the collection (it is 
the vocabulary of distinct words appearing over the entire document repository) . 

Our results. Our first contribution is showing that the redundancy r in bits 



is tight for any succinct index solving our problem, for t = O(loger/loglog0-). 
(All the logarithms in this paper are to the base 2.) We provide matching upper 
and lower bounds for this range of values on t, under the assumption that 0(t) 
probes are allowed for rank and select, i.e. we ignore multiplicative constant 
factors. The result is composed by a lower bound of r = Q( n bits that holds 
for t = o(logcr) and by an upper bound of r = 0( nl ° scr + n log logo - ). We also 
provide a lower bound of r = ° gt ) for t = 0(n), thus leaving open what the 
optimal redundancy when t — i7( ^'"^^ ). Running times for the upper bound 
are 0(t + log logo - ) for rank and 0(t) for select. 

An interpretation of (1) is that, given a data collection D, if we want to 
build an additional succinct index on D that saves space by a factor t over that 
taken by D, we have to pay Q(f) access cost for the supported queries. Note 
that the plain storage of the sequence s itself requires nloga bits. Moreover, our 
result shows that it is suboptimal to build a individual succinct indexes (like 
those for the binary-alphabet case, e.g. [19]), one per symbol c € [a]: the latter 
approach has redundancy 0( (T " 1 °s t ) while the optimal redundancy is given in 
eq. (1), when t — OQogcr/loglogcr). 




(1) 
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Lower bounds are our main findings, while the matching upper bounds are 
derived from known algorithmic techniques. Thus, our second contribution is 
a general technique that extends the algorithmic encoding/decoding approach 
in [5] in the sense that it abstracts from the specific query operation at hand, and 
focuses on its access pattern solely. For this, we can single out a sufficiently large, 
conflict free subset of the queries that are classified as stumbling or z-unique. In 
the former case, we extract direct knowledge from the probed locations; in the 
latter, the novelty of our approach is that we can extract (implicit) knowledge 
also from the unprobed locations. We are careful not to exploit the specific se- 
mantics of the query operations at this stage. As a result, our technique applies 
to other kinds of query operations for predecessor, prefix sum, permutation, and 
pattern searching problems, to name a few, as long as we can extract a suf- 
ficiently large subset of the queries with the aforementioned features. We will 
discuss them extensively in the full version. 

We also provide further running times for the rank/select problem. For exam- 
ple, if a = (logn) ^ 1 ), the rank operation requires only 0(t) time; also, we can 
get 0(t log logo - log^ a) time 4 for rank and 0(t log log er) time for select (The- 
orem 5). We also have a lower bound of r = fi ( - '° s * ) bits for the redundancy 
when 1 < t < n/2, which leaves open what is the optimal redundancy when 
t = Q (log a). As a corollary, we can obtain an entropy-compressed data struc- 
ture that represents s using riHk(s) + 0( 1 " g 1 °^ g ff g ) bits, for any k = o( 1 ^° s 1 ^ o . ), 
supporting access in O(l) time, rank and select in 0(log log a) time (here, 
Hk(s) is the fcth-order empirical entropy). 

Related work. Succinct data structures are generally divided into two kinds, 
systematic and non-systematic [7]. Non-systematic data structures encode the 
input data and any auxiliary information together in a single representation, 
while systematic don't. The concept of succinct indexes applies to systematic 
ones; moreover, the concept of probes does not apply to non-systematic ones, 
since the input data is not distinguished from the index. In terms of time-space 
trade-off, our results extend the complexity gap between systematic and non- 
systematic succinct data structures (which was known for a = 2) to any integer 
alphabet of size a < n. This is easily seen by considering the case of 0(1) 
time/probes for select. Our systematic data structure requires r = O(nlogcr) 
bits of redundancy whereas the non-systematic data structure of [12] uses just 
0(n) bits of redundancy. However, if the latter should also provide 0(l)-time 
access to the encoded string, then its redundancy becomes O(nlogcr). Note 
that eq. (1) is targeted for non-constant alphabet size a whereas, for constant 
size, the lower and upper bounds for the a = 2 case of [8] can be extended to 
obtain a matching bound of ° gt ) bits (see Appendix A.l). 

The conceptual separation of the index from the input data was introduced 
to prove lower bounds in [7] . It was then explicitly employed for upper bounds in 
[6, 13, 20], and was fully formalized in [1]. The latter contains the best known up- 



4 We define log' 1 ' x := log 2 x and for integer i > 2, log' 1 - 1 x := log 2 (log' 1 x). 
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per bounds for our problem 5 , i.e. O(s) probes for select and 0(s log k) probes 
for rank, for any two parameters s < log aj log log a and k < a, with redun- 
dancy 0(n\ogk + n(l/s + l/fc)logcr). For example, fixing s = k = log log a, 
they obtain O(loglogtr) probes for select and OQoglogcrlog^ 3 * 1 a) probes for 
rank, with redundancy 0(n log a/ log log a). By eq. (1), we get the same re- 
dundancy with t — O(loglogcr) probes for both rank and select. Hence, our 
probe complexity for rank is usually better than [1] while that of select is 
the same. Our O(logloger) running times are all better when compared to 
0((log log a) 2 log*- 3 - 1 a) for rank and 0((log log a) 2 ) for select in [1]. 

2 General Technique 

This section aims at stating a general lower bound technique, of independent 
interest, which applies not only to both rank and select but to other query 
operations as well. Suppose we have a set S of strings of length n, and a set Q 
of queries that the must be supported on S using at most t probes each and an 
unknown amount r of redundancy bits. Under certain assumptions on S and Q, 
we can show a lower bound on r. Clearly, any choice of S and Q is allowed for 
the upper bound. 

Terminology. We now give a framework that relies on a simple notion of entropy 
H(S), where H(X) = [log \X\~\ for any class of \X \ combinatorial objects [4]. The 
framework extends the algorithmic encoding/decoding approach [5]. Consider an 
arbitrary algorithm A that can answer to any query in Q performing at most 
t probes on any s G S, using a succinct index with r bits. We describe how to 
encode s using A and the succinct index as a black box, thus obtaining E(s) 
bits of encoding. Then, we describe a decoder that knowing A, the index of r 
bits, and the latter encoding of E{s) bits, is able to reconstruct s in its original 
form. The encoding and decoding procedure are allowed unlimited (but finite) 
computing time, recalling that A can make at most t probes per query. 

The lower bound on r arises from the necessary condition max se s E(s) + r > 
H(S), since otherwise the decoder cannot be correct. Namely, r > H(S) — 
max s E(s): the lower E(s), the tighter the lower bound for r. Our contribution 
is to give conditions on S and Q so that the above approach can hold for a 
variety of query operations, and is mostly oblivious of the specific operation at 
hand since the query access pattern to s is relevant. This appears to be novel. 

First, we require S to be sufficiently dense, that is, H(S) > nloger — 6>(n). 
Second, Q must be a subset of [a] x [n], so that the first parameter specifies a 
character c and the second one an integer p. Elements of Q are written as q c ^ p . 
Third, answers to queries must be within [n]. The set Q must contain a number 
of stumbling or ^-unique queries, as we define now. Consider an execution of A 
on a query q c ^ p € Q for a string s. The set of accessed position in s, expressed 
as a subset of [n] is called an access pattern, and is denoted by Pat s (g C)P ). 

5 We compare ourselves with the improved bounds given in the full version of [1]. 
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First, stumbling queries imply the occurrence of a certain symbol c inside 
their own access pattern: the position of c can be decoded by using just the 
answer and the parameters of the query. Formally, q c ^ p G Q is stumbling if there 
exists a computable function / that takes in input c,p and the answer of q CjP 
over s, and outputs a position x G Pat s (g CiP ) such that s[x] — c. The position 
x is called the target of q CtP . The rationale is that the encoder does not need 
to store any information regarding s[x] — c, since x can be extracted by the 
decoder from / and the at most t probed positions by A. We denote by Q' s C Q 
the set of stumbling queries over s. 

Second, z-unique queries are at the heart of our technique, where z is a pos- 
itive integer. Informally, they have specific answers implying unique occurrences 
of a certain symbol c in a segment of s of length z + 1. Formally, a set U of an- 
swers is z-unique if for every query q CtP having answer in U, there exists a unique 
i G \p,p + z] such that s[i] = c (i.e. s[j] ^ c for all j G \p,p + z], j ^ i). A query 
q c ^ p having answer in U is called z-unique and the corresponding position i is 
called the target of q CyP . Note that, to our purposes, we will restrict to the cases 
where H(U) = 0(n). The rationale is the following: when the decoder wants to 
rebuild the string it must generate queries, execute them, and test whether they 
are z-unique by checking if their answers are in U. Once that happens, it can 
infer a position i such that s[i] = c, even though such a position is not probed 
by the query. We denote by <3"(z) !== Q\Q' S the set of z-unique queries over s 
that are not stumbling. We also let Tgt s (g CiP ) denote the target of query q CjP 
over s, if it exists, and let Tgt s (Q) = U q <zQTgt s (q) for any set of queries Q. 

Main statement. We now state our main theorem. Let S be a set of strings such 
that H(S) > n log a — 0{n). Consider a set of queries Q that can be answered 
by performing at most t probes per query and using r bits of redundancy. 

Theorem 1. For any z G [a], let X(z) = mm seS \Tgt s (Q' s ) U Tgt s (Q^(z))\. 
Then, there exists integers 7 and 5 with min{A(z), n}/(15t) < 7 + 5 < X(z), 
such that any succinct index has redundancy 



The proof goes through a number of steps, each dealing with a different issue 
and is deferred to Section 3. 

Applications. We now apply Theorem 1 to our two main problems, for an 
alphabet size a < n. 

Theorem 2. Any algorithm solving rank queries on a string s G [a] n using 
at most t = o(loger) character probes (i.e. access queries), requires a succinct 
index with r — J?( " 1 ° gCT ) bits of redundancy. 

Proof. We start by defining the set S of strings. For the sake of presentation, 
suppose a divides n. An arbitrary string s G S is the concatenation of n/a 
permutations of [a]. Note that |5| = (a\) n ^ a and so we have H(S) > nloger — 
0(n) bits (by Stirling's approximation). 
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Without loss of generality, we prove the bound on a derivation of the rank 
problem. We define the set Q and fix the parameter z = <7 3 / 4 Vt, so that the 
queries are q c ^ p = rank(c,p + z) — rank(c,p), where c <E [a] and p € [n] with 
p mod z = 0. In this setting, the z-unique answers are in U = {!}. Indeed, 
whenever q CjP = 1, there exists just one instance of c in s\p,p + z]. Note that 
|Q| = na/z > n, for a larger than some constant. 

Observe that X(z) > n, as each position i in s such that s[i] — c, is the target 
of exactly one query q CiP : supposing the query is not stumbling, such a query is 
surely z-unique. By Theorem 1, 7 + S > n/(30i) since a single query is allowed 
to make up to 2t probes now. (This causes just a constant multiplicative factor 
in the lower bound.) 

Having met all requirements, we apply Theorem 1, and get 

r> 7 log(^)-<nog(g) (2) 

We distinguish between two cases. If S < n/a 1 / 4 , then 8\og((nt) / (z8)) < 
log( "*^ z ) < 2<7 "y 4 log(i/cr), since <51og(l/5) is monotone increasing in <5 as 
long as S < A(z)/2 (and n/a 1 ^ < A(z)/2 for sufficiently large a). Hence, recalling 
that t = o(logcr), the absolute value of the second term on the right hand of (2) 
is o{n/t) for a larger than a constant. Moreover, 7 > n/(30t) — S > n/(60t) in 
this setting, so that the bound in (2) reduces to 

ft Tt Ti 

T > TTTTT l°g a — log t — 0{n) = — — log (7 — 0(n). 

~ 240i I20t y ' 240i v ; 

In the other case, we have 5 > n/a 1 / 4 , and S \og((nt) / (zS)) < 5\og( 



|log(t/cr). Therefore, we know in (2) that ^\og{a/z) + (5/2) log(a/t) > |(7 + 
5) log((r/z), as we chose z > t. Again, we obtain 

r > logo- — 0{n). 

In both cases, the 0(n) term is negligible as t = o(logcr), hence the bound. □ 

Theorem 3. Any algorithm solving select queries on a string s <G [a] n using 
at most t = o(loger) character probes (i.e. access queries), requires a succinct 
index with r — Q ( " '° g a ) bits of redundancy. 

Proof. The set S of strings is composed by full strings, assuming that a divides n. 
A full string contains each character exactly n/a times and, differently from 
Theorem 2, has no restrictions on where they can be found. Again, we have 
H(S) > nloga-0(n). 

The set Q of queries is q c ^ p = select(c,p), where p £ [n/a}, and all queries 
in Q are stumbling ones, as select(c, i) = x immediately implies that s[x] = c 
(so / is the identity function). There are no z-unique queries here, so we can 
fix any value of z: we choose z = 1. It is immediate to see that X(z) = n, and 
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\Q\ = n, as there are only n/a queries for each symbols in [a]. By Theorem 1, 
we know that 7 + 5 > nj (15i). Hence, the bound is 

(jS 71 

r > 7logo- + 51og(^) > ^-\ g(a/t 2 )-e(n). 
Again, as t = o(log<r) the latter term is negligible and the bound follows. □ 



3 Proof of Theorem 1 



We give an upper bound on E(s) for any s £ S by describing an encoder and 
a decoder for s. In this way we can use the relation max s£ 5 E(s) + r > H(S) 
to induce the claimed lower bound on r (see Section 2). We start by discussing 
how we can use z-unique and stumbling queries to encode a single position and 
its content compactly. Next, we will deal with conflicts between queries: not all 
queries in Q are useful for encoding. We describe a mechanical way to select a 
sufficiently large subset of Q so that conflicts are avoided. Bounds on 7 and A 
arise from such a process. To complete the encoding, we present how to store 
the parameters of the queries that the decoder must run. 

Entropy of a single position and its content. We first evaluate the entropy 
of positions and their contents by exploiting the knowledge of z-uniquc and 
stumbling queries. We use the notation H(S\f2) for some event ft as a shortcut 
for H(S') where S' = {s€ S\s satisfies [}}. 

Lemma 1. For any z £ [a], let Q CtP be the condition "q CtP is z-unique". Then 
it holds H(S) - H(S\f2 c J) > log(a/z) - 0(1). 

Proof. Note that set (S\f2 c , p ) — {s £ S : Tgt s (g c , p ) is defined on s} for a given 
query q c ^ p . It is |(iS|/2 CiP )| < (z + 1)(7™ _1 since there at most z + 1 candidate 
target cells compatible with j? C;P and at most \S\/a possible strings with position 
containing c at a fixed position. So, H(S\f2 CtP ) < log(z + 1) + H (S) — log a, hence 
the bound. □ 

Lemma 2. Let Q' cp be the condition "q CjP is a stumbling query". Then, it holds 
that H(S) - H(S\n' c J > log(<r/i) - 0(1). 

Proof. The proof for this situation is already known from [9]. In our notation, 
the proof goes along the same lines as that of Lemma 1, except that we have 
t choices instead of z + 1. To see that, let mi, mi, ■ ■ ■ , nit be the positions, in 
temporal order, probed by the algorithm A on s while answering q c p . Since the 
query is stumbling, the target will be one of mi,...,m t . It suffices to remember 
which one of the t steps probe that target, since their values mi,...,m ( are 
deterministically characterized given A, s, q CjP . □ 

Conflict handling. In general, multiple instances of Lemma 1 and/or Lemma 2 
cannot be applied independently. We introduce the notion of conflict on the 
targets and show how to circumvent this difficulty. Two queries q^ and q CtP 
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conflict on s if at least one of the following three condition holds: (i) Tgt s (g C;P ) e 
Pat s (g 6;0 ), (ii) Tgt s (<? M ) e Pat s (g C)P ), (m) Tgt s (q C!P ) = Tgt s A set of 
queries where no one conflicts with another is called conflict free. The next 
lemma is similar to the one found in [10], but the context is different. 

Lemma 3 defines a lower bound on the maximum size of a conflict free subset 
of Q. We use an iterative procedure that maintains at each ith step a set Q* of 
conflict free queries and a set C, of available targets, such that no query q whose 
target is in d will conflict with any query q' <E Q*_i- Initially, Co contains all 
targets for the string s, so that by definition |Co| > A(z). Also, Qq is the empty 
set. 

Lemma 3. Let i > 1 be an arbitrary step and assume | C< 1 1 > 2|Co|/3. Then, 

there exists Q* and Cj such that (a) \Q*\ = 1 + IQJLJ, (b) Q* is conflict free, 
(c) \Ci\ > \C Q \ - hit > \{z) - hit. 

Proof. We first prove that there exists u e Cj_i such that no more than 3i 
queries probe u. Assume by contradiction that for any u, at least 3t queries 
probe u. Then, we would collect 3i|Cj_i| > 2|Co|i probes in total. However, 
any query can probe at most t cells, summing up to \Co\t, giving a contradic- 
tion. At step i, we choose u as a target, say, of query q c ^ p for some c,p. This 
maintains invariant (a) as Q* — Q*_ 1 U {q c , P }- As for invariant (b), we remove 
the potentially conflicting targets from Cj_i, and produce Cj. Let I u C Cj_i 
be the set of targets for queries probing u over s, where by the above proper- 
ties \I U \ < 2>t. We remove u and the elements in I u and Pat s (<7 CjP ). So, |Cj| = 
|C , i -i|-|{u}|-|/ u |-|Pat a (g CiP )|>|C? i _i|-l-3t-t>|C'o|-5it. □ 

By applying Lemma 3 until \C%\ < 2|Co|/3, we obtain a final set Q* , hence the 
following: 

Corollary 1. For any s <G S, z € [a], there exists a set Q* containing z-unique 
and stumbling queries of size 7 + S > min{A(^), n}/(15t), where 7 = |{g e 
Q*\q is stumbling on s}\ and S = \{q G Q*\q is z-unique on s}\. 

Encoding. We are left with the main task of describing the encoder. Ideally, 
we would like to encode the targets, each with a cost as stated in Lemma 1 and 
Lemma 2, for the conflict free set Q* mentioned in Corollary 1. Characters in 
the remaining positions can be encoded naively as a string. This approach has a 
drawback. While encoding which queries in Q are stumbling has a payoff when 
compared to Lemma 2, we don't have such a guarantee for z-unique queries when 
compared to Lemma 1. Without getting into details, according to the choice of 
the parameters \Q\, z and t, such encoding sometimes saves space and sometimes 
does not: it may use even more space than H(S). For example, when \Q\ — 0(n), 
even the naive approach works and yields an effective lower bound. Instead, if 
Q is much larger, savings are not guaranteed. The main point here is that we 
want to overcome such a dependence on the parameters and always guarantee a 
saving, which we obtain by means of an implicit encoding of z-unique queries. 
Some machinery is necessary to achieve this goal. 
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Archetype and trace. Instead of trying to directly encode the information 
of Q* as discussed above, we find a query set Q A called the archetype of Q* , 
that is indistinguishable from Q* in terms of 7 and 5. The extra property of 
Q A is to be decodable using just 0(n) additional bits, hence E(s) is smaller 
when Q A is employed. The other side of the coin is that our solution requires 
a two-step encoding. We need to introduce the concept of trace of a query q c<p 
over s, denoted by Trace s (g C)P ). Given the access pattern Pa,t s (q CtP ) = {mi < 
m 2 < • • • < m t } (see Section 2), the trace is defined as the string Trace s (<7 CjP ) = 
s[mi] ■ s[m2] ■ ■ ■ ■ s[m t ]. We also extend the concept to sets of queries, so that for 
Q Q Q, we have P&t s (Q) = {J qe QPa,t s (q), and Trace s (Q) is defined using the 

sorted positions in Pat s (<3). 

Then, we define a canonical ordering between query sets. We define the 
predicate q C:P < qd, g iS p < g ov p = g A c < d over queries, so that we can 
sort queries inside a single query set. Let Qi = {qi -< q% ~< ■ ■ ■ ~< q x } and let 
Qi = {q[ ~< q'2 ~< ■ ■ ■ ~< q' y } be two distinct query sets. We say that Qi ~< Q2 iff 
either qi -< q[ or recursively (Qi \ {qi}) -< (Q 2 \ {q[})- 

Given Q* , its archetype Q A obeys to the following conditions for the given s: 

— it is conflict free and has the same number of queries of Q*; 

— it contains exactly the same stumbling queries of Q* , and all remaining 
queries are z-unique (note that they may differ from those in Q*); 

— if pi , P2, ■ ■ ■ , p x are the positional arguments of queries in Q* , then the same 
positions are found in Q A (while character ci,c 2 , . . . ,c x may change); 

— Pat s (Q*) = P(it s (Q A ); 

— among those query sets complying with the above properties, it is the mini- 
mal w.r.t. to the canonical ordering -<. 

Note that Q* complies with all the conditions above but the last. Therefore, 
the archetype of Q* always exists, being either a smaller query set (w.r.t. to -<) 
or Q* itself. The encoder can compute Q A by exhaustive search, since its time 
complexity is not relevant to the lower bound. 

First step: encoding for trace and stumbling queries. As noted above the 
stumbling queries for Q* and Q A are the same, and there are S of them. Here, 
we encode the trace together with the set of stumbling queries. The rationale is 
that the decoder must be able to rebuild the original trace only, whilst encoding 
of the positions which are not probed is left to the next step, together with 
z-unique queries. Here is the list of objects to be encoded in order: 

(a) The set of stumbling queries expressed as a subset of Q. 

(b) The access pattern P&t s (Q A ) encoded as a subset of [n], the positions of s. 

(c) The reduced trace, obtained from Ti&ce s (Q A ) by removing all the characters 
in positions that are targets of stumbling queries. Encoding is performed 
naively by storing each character using loga bits. The positions thus re- 
moved, relatively to the trace, are stored as a subset of [|Tracc s (Q' 4 )|]. 

(d) For each stumbling query q CjP , in the canonical order, an encoded integer i 
of log t bits indicating that the ith probe accesses the target of the query. 
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The decoder starts with an empty string, it reads the access pattern in (6), 
the set of removed positions in (c) , and distributes the contents of the reduced 
trace (c) into the remaining positions. In order the fill the gaps in (c), it recovers 
the stumbling queries in (a) and runs each of them, in canonical ordering. Using 
the information in (d), as proved by Lemma 2, it can discover the target in which 
to place its symbol c. Since Q A is conflict free, we are guaranteed that each query 
will always find a symbol in the probed positions. 

Lemma 4. Let I be the length of Trace s (Q A ). The first step encodes information 

(a) -(d) using at most £ log a + 0(n) + S\og(\Q\/S) — 5log(a/t) bits. 

Proof. Space occupancy for all objects: (a) uses log ('^') = S\og(\Q\/5) + 0(8); 

(b) uses log (™) < n bits; (c) uses (I — S) log a bits for the reduced trace plus at 
most I bits for the removed positions; (d) uses 8\ogt bits. □ 

Second step: encoding of z-unique queries and unprobed positions. We 

now proceed to the second step, where targets for z-unique queries are encoded 
along with the unprobed positions. They can be rebuilt using queries in Q A . To 
this end, we assume that encoding of Lemma 4 has already been performed and, 
during decoding, we assume that the trace has been already rebuilt. Recall that 
7 is the number of z-unique queries. Here is the list of objects to be encoded: 

(e) The set of queries in Q A that are z-unique, expressed as a subset of Q A 
according to the canonical ordering -<!. Also the set of z-unique answers U 
is encoded MS (X subset of \n\. 

(/) For each z-unique query q CiP , in canonical order, the encoded integer p. This 
gives a multiset of 7 integers in [n]. 

(g) The reduced unprobed region of the string, obtained by removing all the 
characters in positions that are targets of z-unique queries. Encoding is per- 
formed naively by storing each character using log a bits. The positions thus 
removed, relatively to the unprobed region, are stored as a subset of [n — £] . 

(h) For each z-unique query q CiP , in the canonical order, an encoded integer i of 
logz + 0(1) bits indicating which position in [p,p + z] contains c. 

The decoder first obtains Q A by exhaustive search. It initializes a set of 
\Q A \ empty couples (c,p) representing the arguments of each query in canonical 
order. It reads (e) and reuses (a) to obtain the parameters of the stumbling 
queries inside Q A . It then reads (/) and fills all the positional arguments of the 
queries. Then, it starts enumerating all query sets in canonical order that are 
compatible with the arguments known so far. That is, it generates characters 
for the arguments of z-unique queries, since the rest is known. Each query set is 
then tested in the following way. The decoder executes each query by means of 
the trace. If the execution tries a probe outside the access pattern, the decoder 
skips to the next query set. If the query conflicts with any other query inside the 
same query set, the decoder skips. If the query answer denotes that the query is 
not z-unique (see Section 2 and (e)), it skips. In this way, all the requirements 
for the archetype are met, hence the first query set that is not skipped is Q A . 
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Using Q A the decoder rebuilds the characters in the missing positions of the 
reduced unprobcd region: it starts by reading positions in (g) and using them to 
distribute the characters in the reduced region encoded by (g) again. For each z- 
unique query q CyP £ Q A , in canonical order, the decoder reads the corresponding 
integer i inside (h) and infers that s[i + p] = c. Again, conflict freedom ensures 
that all queries can be executed and the process can terminate successfully. Now, 
the string s is rebuilt. 

Lemma 5. The second step encodes information (e)-{h) using at most (n — 
tj log a + 0(n) — 7 log(cr/z) bits. 

Proof. Space occupancy: (e) uses log ('^ ') < \Q A \ bits for the subset plus, 
recalling from Section 2, 0{n) bits for U ; (/) uses log ("^ 7 ) < 2n bits; (g) 
requires (n — I — 7) log a bits for the reduced unprobed region plus log (™~ ) bits 
for the positions removed; (h) uses 7 log z + 0(7) bits. □ 

Proof (of Theorem 1). By combining Lemma 4 and Lemma 5 we obtain that 
for each s <= S, E(s) < n log a + 0(n) + (5 log - 7 lo S (f ) ■ We know that 

r + niax S £s E(s) > H(S) > nloga — 0(n), hence the bound follows. □ 

4 Upper bounds 

Our approach follows substantially the one in [1], but uses two new ingredients, 
that of monotone hashing [3] and succinct SB-trees [14], to achieve an improved 
(and in many cases optimal) result. We first consider these problems in a slightly 
different framework and give some preliminaries. 

Preliminaries. We are given a subset T C [cr], where |T| = to. Let R(i) = 
\{j <= T\j < i}\ for any i e [a], and S(i) be the i + 1st element of T, for any 
i e [to]. 

The value of S(R(p)) for any p is named the predecessor of p inside T. For any 
subset T C [cr], given access to S(-), a succinct SB-tree [14] is a systematic data 
structure that supports predecessor queries on T, using 0(|T| log log ct) extra 
bits. For any c > such that |T| = 0(log c er), the succinct SB-tree supports 
predecessor queries in O(c) time plus 0(c) calls to S(-). The data structure 
relies on a precomputcd table of n 1- ^ 1 ) bits depending only on a,d not on T. 

A monotone minimal perfect hash function for T is a function hr such that 
hr{x) = R(x) for all x e T, but hr{x) can be arbitrary if x £ T. We need the 
following result: 

Theorem 4 ([3]). There is a monotone minimal perfect hash function for T 
that: 

• occupies O(TOloglogcr) bits and can be evaluated in 0(1) time; 

• occupies 0(m\og^ a) bits and can be evaluated in O(logloga) time. 
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Although function R(-) has been studied extensively in the case that T is 
given explicitly, we consider the situation where T can only be accessed through 
(expensive) calls to S(-). We also wish to minimize the space used (so e.g. cre- 
ating an explicit copy of T in a preprocessing stage, and then applying existing 
solutions, is ruled out). We give the following extension of known results: 

Lemma 6. Let T C [a] and \T\ = m. Then, for any 1 < k < log log a, there is 
a data structure that supports R(-) in O(logloger) time plus 0(1 + logfc) calls 
to S(-), and uses 0{{m/k) log log a) bits of space. The data structure uses a 
pre-computed table (independent ofT) of size cr 1-1 ^ 1 ) bits. 

Proof. We construct the data structure as follows. We store every (log cr)th ele- 
ment of T in a y-fast trie [22] . This divides T into buckets of log a consecutive 
elements. For any bucket B, we store every fcth element of T in a succinct SB- 
tree. The space usage of the y-fast trie is 0(m) bits, and that of the succinct 
SB-tree is 0((m/k) log log a) bits. 

To support R(-), we first perform a query on the y-fast trie, which takes 
O (log log a) time. We then perform a query in the appropriate bucket, which 
takes 0(1) time by looking up a pre-computed table (which is independent of 
T) of size fT 1_fi W. The query in the bucket also requires O(l) calls to S(-). We 
have so far computed the answer within k keys in T: to complete the query for 
R(-) we perform binary search on these k keys using O(logfc) calls to S(-). 

Supporting rank and select. In what follows, we use Lemma 6 choosing k = 1 
and k — log log a. We now show the following result, contributing to eq. (1). 
Note that the first option in Theorem 5 has optimal index size for t probes, for 
t < log (7 / log log o~. The second option has optimal index size for t probes, for 
t < logcr/log^ 3 - 1 cr, but only for select. 

Theorem 5. For any 1 < t < a, there exist data structures with the following 
complexities: 

(a) select in 0{t) probes and 0{t) time, and rank in 0(t) probes and 0(t + 
log log a) time using a succinct index with r = 0(n(log log cr + (log<r)/f)) 
bits of redundancy. If a = (logn) ' 1 ', the rank operation requires only 0(t) 
time. 

(b) select in 0(t) probes and 0(tloglogcr) time, and rank in 0(t log' 3 -' a) 
probes and 0(t log log a log*- 3 -* cr) time, using r = 0(n(\og^ a + (\oga)/t)) 
bits of redundancy for the succinct index. 

Proof. We divide the given string s into contiguous blocks of size a (assume for 
simplicity that a divides n = \s\). As in [1, 12], we use 0(n) bits of space, and 
incur an additive (9(l)-timc slowdown, to reduce the problem of supporting rank 
and select on s to the problem of supporting these operations on a given block 
B. We denote the individual characters of B by B [0] , . . . , B [a — 1]. 

Our next step is also as in [1]: letting n c denote the multiplicity of character 
c in B, we store the bitstring Z = l"°01 ni . . . l n — 1 0, which is of length 2cr, 
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and augment it with the binary rank and select operations, using 0(a) bits 
in all. Let c = B[i] for some < i < a — 1, and let tt [i] be the position of c in 
a stably sorted ordering of the characters of B (tt is a permutation). As in [1], 
select(c, •) is reduced, via Z, to determining 7r — 1 (j) for some j. As shown in 
[16], for any 1 < t < a, permutation tt can be augmented with 0(a + (a \ogo~)/t) 
bits so that 7r _1 (j) can be computed in 0(t) time plus t evaluations of tt(-) for 
various arguments. 

If T c denotes the set of indexes in B containing the character c, we store a 
minimal monotone hash function hx r on T c , for all c € [a]. To compute Tr(i), we 
probe s to find c = B [i], and observe that n(i) = R(i) + X)i=o ni - The l a tter 
term is obtained in 0(1) time by rank and select operations on Z, and the 
former term by evaluating hr c (i)- By Theorem 4, the complexity of select(c, i) 
is as claimed. 

As noted above, supporting rank(c, i) on s reduces to supporting rank on 
an individual block B. If T c is as above, we apply Lemma 6 to each T c , once 
with k = 1 and once with k — log log a. Lemma 6 requires some calls to S(-), 
but this is just select(c, •) restricted to B, and is solved as described above. If 
a = (logn) ' 1 ), then \T C \ — (logn) ' 1 ^, and we store T c itself in the succinct 
SB-tree, which allows us to compute R(-) in 0(1) time using a (global, shared) 
lookup table of size n 1_fi W bits. □ 

The enhancements described here also lead to more efficient non- systematic data 
structures. Namely, for a = 0(n e ) , < e < 1, we match the lower bound of 
[10, Theorem 4.3]. Moreover, we improve asymptotically both in terms of space 
and time over the results of [1]: 

Corollary 2. There exists a data structure that represents any string s of length 
n using nHk(s) + 0( 1 " g 1 °^ g cr fT ) bits, for any k = °(j^r^^), supporting access in 
0(1) time, rank and select in O(logloger) time. 

Proof. We take the data structure of Theorem 5(a), where r = 0( [^^^ ) ■ We 
compress s using the high-order entropy encoder of [6, 13, 20] resulting in an 
occupancy of nHk(s) + a bits, where Hk(s) is the fcth-order empirical entropy 
and a is the extra space introduced by encoding. We have a = O( lo ™ w (fclog*r + 

log log n)), which is 0( ^°^ a ) for our choice of k, hence it doesn't dominate 
on the data structure redundancy. Operation access is immediately provided 
in O(l) time by the encoded structure, thus the time complexity of Theorem 5 
applies. □ 
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A Appendix 



A.l Extending previous work 

In this section, we prove a first lower bound for rank and select operations. We 
extend the existing techniques of [8], originally targeted at a = 2. The bound 
has the advantage to hold for any 1 < t < n/2, but it is weaker than eq. (1) 
when logi = o(logcr). 

Theorem 6. Let s be an arbitrary string of length n over the alphabet S = [a], 
where a < n. Any algorithm solving rank or select queries on s using at most 
t character probes (i.e. access queries), where 1 < t < n/2, requires a succinct 
index with r — i7( " 1 ° gt ) bits of redundancy. 

Intuitively speaking, the technique is as follows: it first creates a set of queries the 
data structure must answer and then partitions the string into classes, driven by 
the algorithm behaviour. A bound on the entropy of each class gives the bound. 
However, our technique proves that finding a set of queries adaptively for each 
string can give an higher bound for t = o(logcr). 

Before getting into the full details we prove a technical lemma that is based 
on the concept of distribution of characters in a string: Given a string T of 
length u over alphabet <p, the distribution (vector) d for u over is a vector in 
N 1 * containing the frequency of each character in T. We can state: 

Lemma 7. For any 4>>2,u><f> and distribution d for u on (f>, it holds 



Proof. The maximization follows from the concavity of the multinomial function 
and the uniquness of its maximum: the maximum is located at the uniform dis- 
tribution d = (u/(f>, u/(f>, . . . , u/<p). The upper bound arises from double Stirling 
inequality, as we have: 



and the lemma follows. 

Let L = Sat and assume for sake of simplicity that L divides n. We start by 
defining the query set 

Q = {select(c, 3ti)\c e [a] A i e [n/L]} 




< 
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having size 7 = ^7 = gj- The set of strings on which we operate, S, is de- 
signed so that all queries in Q return a position in the set. We build strings by 
concatenating n/L chunks, each of which is generated in all possible ways. A 
single chunk is built by aligning 3t occourrences of each symbol in [a] and then 
permuting the resulting substring of length L in any possible way. 

A choices tree for Q is a composition of smaller decision trees. At the top, 
we build the full binary tree of height r, each leaf representing a possible choices 
for the index bits values. For each leaf of r, we append the decision tree of our 
algorithm for the first query Q\ on every possible string conditioned on the 
choice of the index. The decision tree has height at most t and each node has 
fan-out a, being all possible results of probing a location of the string. Each node 
is labeled with the location the algorithm chooses to analyze, however we are not 
interested in this information. The decision tree has now 2 r a t leaves. At each 
leaf we append the decision tree for the second query Q2 , increasing the number 
of leaves again, and so on up to Q 7 . Without loss of generality we will assume 
that all decision trees have height exactly t and that each location is probed only 
once (otherwise we simply remove double probes and add some padding ones in 
the end). Leaves at the end of the whole decision tree are assigned strings from S 
which are compatible with the root-to-leaf path: each path defines a set of answers 
A for all 7 queries and a string is said to be compatible with a leave if the answers 
to Q on that string is exactly A and all probes during the path match the path. 
For any leaf x, we will denote the amount of compatible strings by C(x). Note 
that the tree partitions the entire set of strings, i.e. ^ x ■ loaf C(x) = \S\. Our 
objective is to prove that C (x) cannot be too big, and so prove that to distinguish 
all the answer sets the topmost tree must have at least some minimum height. 
More in detail, we will first compute C* , an upper bound on C(x) for any x, 
and then use the following relation to obtain the bound: 

log|5| = log ^ C ( x ) ^ lo S(# of leaves) -flog C* < r +t*y log cr + log C* (3) 

x is a leaf 

Before continuing, we define some notation. For any path, the number of 
probed locations is £7 = n/3, while the number of unprobed locations is denoted 
by U. We divide a generic string in some leaf x into consecutive blocks of char- 
acters defined depending on the answer set to Q for that leaf, as follows. The 
set Si € Q of a of queries is defined as Si = Q fl {(c, x) € [n] x [a] \x = i}; we 
define the block Bi as the interval [min c A x (Si), (min c A x (Si + \)) — 1] (where A x 
defines the answer to a set of queries) , i.e. the maximum span covered by answer 
set to Si in some leaf x. Note that the partitioning in blocks is dependant only 
on Q, i.e. it is typical of a leaf and not of a specific string, and that due to our 
particular choice of S, the length \Bi\ is exactly L. Thus, the number of blocks 
is n/L. 

We now associate a conceptual value Ui to each block, which represents the 
number of unprobed characters in that block, so that J27=i Ui ~ U- As in a leaf 
of the choices tree all probed locations have the same values, the only degree 
of freedom distinguishing compatible strings between themselves lies in the un- 
probed locations. We will compute C* by analyzing single blocks, and we will 
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focus on the right side of the following: 



91- r *r*...r* _ jl 92 ff3 9n/L 
a U c l c 2 C n/L aUl a u 2 a u 3 a u n/L 



where gi < a Ui represents the possible assignment of unprobed characters for 
block i and c* the ratio gi /a Ui . 

We categorize blocks into two classes: determined blocks, having m < at and 
the remaining undetermined ones. For determined ones, we will assume gi — a Ui . 
For the remaining ones we upper bound the possible choices by their maximum 
value, i.e. we employ Lemma 7 to bound their entropy. Joining it with Ui > at 
we obtain: 

cr/2 



- < < (j 



rV2 



lX a/2 
t 



The last step involves finding the number of such determined and undetermined 
blocks. As the number of global probes is at most tj = n/3, the maximum 
number of determined blocks (where the number of probed locations is L — Ui > 
2at) is (tj)/(2at) — n/(2L). The number of undetermined blocks is then at 
least n/L — n/(2L) = n/(2L). Recalling that our upper bound increases with 
the number of determined blocks, we keep it to the minimum. Therefore, we 
have: 

logC . < ulog „ + Jl£ log (I) + Jl I log „ . e (2 log (i)) ( 5) 

Joining Equation 5, 3 and the fact that tj + U = n, we obtain that 
n log a — — = log | S\ < r + t-f log a + U log a — O y — log t J 

and the bound follows. 

We can prove an identical result for operation rank. The set S of hard strings 
is the set of all strings of length n over a. We conceptually divide the strings 
in blocks of L = Sat consecutive positions, starting at 0. With this in mind, we 
define the set of queries 

Q = {rank(c,ii)|c G [a] A i E [n/L]}, 

i.e. we ask for the distribution of the whole alphabet every L characters, resulting 
in a batch of 7 = ^ queries. The calculations are then parallel to the previous 
case. □ 
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